{
 "cells": [
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    ".. _email_userguide:\n",
    "\n",
    "Email Addresses\n",
    "==============="
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {
    "raw_mimetype": "text/restructuredtext"
   },
   "source": [
    "Introduction\n",
    "------------\n",
    "\n",
    "The function :func:`clean_email() <dataprep.clean.clean_email.clean_email>` cleans and standardizes a DataFrame column containing email addresses. The function :func:`validate_email() <dataprep.clean.clean_email.validate_email>` validates either a single email address or a column containing email addresses, returning True if the email address is valid and False otherwise.\n",
    "\n",
    "To remove all whitespace from the input value before cleaning it, you can set the parameter ``remove_whitespace`` to True. This will clean an invalid email address like \"hello @example.org\" to \"hello@example.org\".\n",
    "\n",
    "The parameter ``fix_domain`` will try to correct typos in the email address's domain when set to True. The first valid domain found will be returned. It employs four strategies to fix a domain:\n",
    "\n",
    "* Swap neighboring characters. This will fix \"gmali.com\" to \"gmail.com\".\n",
    "* Add a single character. This will fix \"gmal.com\" to \"gmail.com\".\n",
    "* Remove a single character. This will fix \"gmails.com\" to \"gmail.com\".\n",
    "* Swap each character with its nearby keys on the qwerty keyboard. This will fix \"gmqil.com\" to \"gmail.com\".\n",
    "\n",
    "You can split the column of email addresses into one column for the usernames and another for the domains by setting the parameter ``split`` to True.\n",
    "\n",
    "Invalid parsing is handled with the ``errors`` parameter:\n",
    "\n",
    "* \"coerce\" (default): invalid parsing will be set to NaN\n",
    "* \"ignore\": invalid parsing will return the input\n",
    "* \"raise\": invalid parsing will raise an exception\n",
    "\n",
    "After cleaning, a **report** is printed that provides the following information:\n",
    "\n",
    "* How many values were cleaned (the value must have been transformed).\n",
    "* How many values could not be parsed.\n",
    "* A summary of the cleaned data: how many values are in the correct format, and how many values are NaN.\n",
    "  \n",
    "The following sections demonstrate the functionality of ``clean_email()`` and ``validate_email()``. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### An example dataset with email addresses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "df = pd.DataFrame({\n",
    "    \"email\": [\n",
    "        \"yi@gmali.com\", \"yi@sfu.ca\", \"y i@sfu.ca\", \"Yi@gmail.com\",\n",
    "        \"H ELLO@hotmal.COM\", \"hello\", np.nan, \"NULL\"\n",
    "    ]\n",
    "})\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Default clean_email()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By default, `clean_email()` will do a strict check to determine if an email address is in the correct format and set invalid values to NaN."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import clean_email\n",
    "clean_email(df, \"email\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. `split` parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By setting the `split` parameter to True, the returned table will contain separate columns for the domain and username of valid emails."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_email(df, \"email\", split=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. `remove_whitespace` parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When the `remove_whitespace` parameter is set to True, whitespace will be removed before checking if an email is valid."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_email(df, \"email\", remove_whitespace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. `fix_domain` parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When the `fix_domain` parameter is set to True, `clean_email()` will try to correct invalid domains."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_email(df, \"email\", fix_domain=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. `error` parameter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "When `errors=\"ignore\"`, invalid emails will be left unchanged in the output"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clean_email(df, \"email\", errors=\"ignore\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. `validate_email()`\n",
    "\n",
    "The function `validate_email()` returns True if an email address is valid and False otherwise. It can be applied on a string or a column of email addresses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from dataprep.clean import validate_email\n",
    "print(validate_email('Abc.example.com'))\n",
    "print(validate_email('prettyandsimple@example.com'))\n",
    "print(validate_email('disposable.style.email.with+symbol@example.com'))\n",
    "print(validate_email('this is\"not\\allowed@example.com'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "validate_email(df[\"email\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that `validate_email()` will do the strict semantic check by default."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Raw Cell Format",
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
